182 research outputs found

    Normalization of oligonucleotide arrays based on the least-variant set of genes

    Get PDF
    <p>Abstract</p> <p>Background</p> <p>It is well known that the normalization step of microarray data makes a difference in the downstream analysis. All normalization methods rely on certain assumptions, so differences in results can be traced to different sensitivities to violation of the assumptions. Illustrating the lack of robustness, in a striking spike-in experiment all existing normalization methods fail because of an imbalance between up- and down-regulated genes. This means it is still important to develop a normalization method that is robust against violation of the standard assumptions</p> <p>Results</p> <p>We develop a new algorithm based on identification of the least-variant set (LVS) of genes across the arrays. The array-to-array variation is evaluated in the robust linear model fit of pre-normalized probe-level data. The genes are then used as a reference set for a non-linear normalization. The method is applicable to any existing expression summaries, such as MAS5 or RMA.</p> <p>Conclusion</p> <p>We show that LVS normalization outperforms other normalization methods when the standard assumptions are not satisfied. In the complex spike-in study, LVS performs similarly to the ideal (in practice unknown) housekeeping-gene normalization. An R package called lvs is available in <url>http://www.meb.ki.se/~yudpaw</url>.</p

    Case-cohort Methods for Survival Data on Families from Routine Registers

    Get PDF
    In the Nordic countries, there exist several registers containing information on diseases and risk factors for millions of individuals. This information can be linked into families by use of personal identification numbers, and represent a great opportunity for studying diseases that show familial aggregation. Due to the size of the registers, it is difficult to analyze the data by using traditional methods for multivariate survival analysis, such as frailty or copula models. Since the size of the cohort is known, case-cohort methods based on pseudo-likelihoods are suitable for analyzing the data. We present methods for sampling control families both with and without replacement, and with or without stratification. The data are stratified according to family size and covariate values. Depending on the sampling method, results from simulations indicate that one only needs to sample 1%-5% of the control families in order to get good efficiency compared to a traditional cohort analysis. We also provide an application to survival data from the Medical Birth Registry of Norway

    Detecting differential expression in microarray data: comparison of optimal procedures

    Get PDF
    BACKGROUND: Many procedures for finding differentially expressed genes in microarray data are based on classical or modified t-statistics. Due to multiple testing considerations, the false discovery rate (FDR) is the key tool for assessing the significance of these test statistics. Two recent papers have generalized two aspects: Storey et al. (2005) have introduced a likelihood ratio test statistic for two-sample situations that has desirable theoretical properties (optimal discovery procedure, ODP), but uses standard FDR assessment; Ploner et al. (2006) have introduced a multivariate local FDR that allows incorporation of standard error information, but uses the standard t-statistic (fdr2d). The relationship and relative performance of these methods in two-sample comparisons is currently unknown. METHODS: Using simulated and real datasets, we compare the ODP and fdr2d procedures. We also introduce a new procedure called S2d that combines the ODP test statistic with the extended FDR assessment of fdr2d. RESULTS: For both simulated and real datasets, fdr2d performs better than ODP. As expected, both methods perform better than a standard t-statistic with standard local FDR. The new procedure S2d performs as well as fdr2d on simulated data, but performs better on the real data sets. CONCLUSION: The ODP can be improved by including the standard error information as in fdr2d. This means that the optimality enjoyed in theory by ODP does not hold for the estimated version that has to be used in practice. The new procedure S2d has a slight advantage over fdr2d, which has to be balanced against a significantly higher computational effort and a less intuititive test statistic

    Super-sparse principal component analyses for high-throughput genomic data

    Get PDF
    <p>Abstract</p> <p>Background</p> <p>Principal component analysis (PCA) has gained popularity as a method for the analysis of high-dimensional genomic data. However, it is often difficult to interpret the results because the principal components are linear combinations of all variables, and the coefficients (loadings) are typically nonzero. These nonzero values also reflect poor estimation of the true vector loadings; for example, for gene expression data, biologically we expect only a portion of the genes to be expressed in any tissue, and an even smaller fraction to be involved in a particular process. Sparse PCA methods have recently been introduced for reducing the number of nonzero coefficients, but these existing methods are not satisfactory for high-dimensional data applications because they still give too many nonzero coefficients.</p> <p>Results</p> <p>Here we propose a new PCA method that uses two innovations to produce an extremely sparse loading vector: (i) a random-effect model on the loadings that leads to an unbounded penalty at the origin and (ii) shrinkage of the singular values obtained from the singular value decomposition of the data matrix. We develop a stable computing algorithm by modifying nonlinear iterative partial least square (NIPALS) algorithm, and illustrate the method with an analysis of the NCI cancer dataset that contains 21,225 genes.</p> <p>Conclusions</p> <p>The new method has better performance than several existing methods, particularly in the estimation of the loading vectors.</p

    Correlation test to assess low-level processing of high-density oligonucleotide microarray data

    Get PDF
    BACKGROUND: There are currently a number of competing techniques for low-level processing of oligonucleotide array data. The choice of technique has a profound effect on subsequent statistical analyses, but there is no method to assess whether a particular technique is appropriate for a specific data set, without reference to external data. RESULTS: We analyzed coregulation between genes in order to detect insufficient normalization between arrays, where coregulation is measured in terms of statistical correlation. In a large collection of genes, a random pair of genes should have on average zero correlation, hence allowing a correlation test. For all data sets that we evaluated, and the three most commonly used low-level processing procedures including MAS5, RMA and MBEI, the housekeeping-gene normalization failed the test. For a real clinical data set, RMA and MBEI showed significant correlation for absent genes. We also found that a second round of normalization on the probe set level improved normalization significantly throughout. CONCLUSION: Previous evaluation of low-level processing in the literature has been limited to artificial spike-in and mixture data sets. In the absence of a known gold-standard, the correlation criterion allows us to assess the appropriateness of low-level processing of a specific data set and the success of normalization for subsets of genes
    corecore